AITopics | music information retrieval

Collaborating Authors

music information retrieval

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Twenty-Five Years of MIR Research: Achievements, Practices, Evaluations, and Future Challenges

Peeters, Geoffroy, Rafii, Zafar, Fuentes, Magdalena, Duan, Zhiyao, Benetos, Emmanouil, Nam, Juhan, Mitsufuji, Yuki

arXiv.org Artificial IntelligenceNov-11-2025

In this paper, we trace the evolution of Music Information Retrieval (MIR) over the past 25 years. While MIR gathers all kinds of research related to music informatics, a large part of it focuses on signal processing techniques for music data, fostering a close relationship with the IEEE Audio and Acoustic Signal Processing Technical Commitee. In this paper, we reflect the main research achievements of MIR along the three EDICS related to music analysis, processing and generation. We then review a set of successful practices that fuel the rapid development of MIR research. One practice is the annual research benchmark, the Music Information Retrieval Evaluation eXchange, where participants compete on a set of research tasks. Another practice is the pursuit of reproducible and open research. The active engagement with industry research and products is another key factor for achieving large societal impacts and motivating younger generations of students to join the field. Last but not the least, the commitment to diversity, equity and inclusion ensures MIR to be a vibrant and open community where various ideas, methodologies, and career pathways collide. We finish by providing future challenges MIR will have to face.

information retrieval, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICASSP49660.2025.10888947

2511.07205

Country:

Europe (0.68)
North America > United States (0.29)

Genre: Research Report (0.50)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.90)

Add feedback

Pretrained Conformers for Audio Fingerprinting and Retrieval

Altwlkany, Kemal, Selmanovic, Elmedin, Delalic, Sead

arXiv.org Artificial IntelligenceSep-12-2025

Conformers have shown great results in speech processing due to their ability to capture both local and global interactions. In this work, we utilize a self-supervised contrastive learning framework to train conformer-based encoders that are capable of generating unique embeddings for small segments of audio, generalizing well to previously unseen data. We achieve state-of-the-art results for audio retrieval tasks while using only 3 seconds of audio to generate embeddings. Our models are almost completely immune to temporal misalignments and achieve state-of-the-art results in cases of other audio distortions such as noise, reverb or extreme temporal stretching. Code and models are made publicly available and the results are easy to reproduce as we train and test using popular and freely available datasets of different sizes.

artificial intelligence, distortion, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2508.11609

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)

Add feedback

MAIA: An Inpainting-Based Approach for Music Adversarial Attacks

Liu, Yuxuan, Zhang, Peihong, Sang, Rui, Li, Zhixin, Li, Shengchen

arXiv.org Artificial IntelligenceSep-8-2025

Music adversarial attacks have garnered significant interest in the field of Music Information Retrieval (MIR). In this paper, we present Music Adversarial Inpainting Attack (MAIA), a novel adversarial attack framework that supports both white-box and black-box attack scenarios. MAIA begins with an importance analysis to identify critical audio segments, which are then targeted for modification. Utilizing generative inpainting models, these segments are reconstructed with guidance from the output of the attacked model, ensuring subtle and effective adversarial perturbations. We evaluate MAIA on multiple MIR tasks, demonstrating high attack success rates in both white-box and black-box settings while maintaining minimal perceptual distortion. Additionally, subjective listening tests confirm the high audio fidelity of the adversarial samples. Our findings highlight vulnerabilities in current MIR systems and emphasize the need for more robust and secure models.

artificial intelligence, machine learning, perturbation, (18 more...)

arXiv.org Artificial Intelligence

2509.0498

Country: Europe > Austria (0.28)

Genre: Research Report > New Finding (0.48)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)
Information Technology > Security & Privacy (1.00)
Government > Military (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Layer-wise Investigation of Large-Scale Self-Supervised Music Representation Models

Zhou, Yizhi, Zhu, Haina, Chen, Hangting

arXiv.org Artificial IntelligenceMay-23-2025

Recently, pre-trained models for music information retrieval based on self-supervised learning (SSL) are becoming popular, showing success in various downstream tasks. However, there is limited research on the specific meanings of the encoded information and their applicability. Exploring these aspects can help us better understand their capabilities and limitations, leading to more effective use in downstream tasks. In this study, we analyze the advanced music representation model MusicFM and the newly emerged SSL model MuQ. We focus on three main aspects: (i) validating the advantages of SSL models across multiple downstream tasks, (ii) exploring the specialization of layer-wise information for different tasks, and (iii) comparing performance differences when selecting specific layers. Through this analysis, we reveal insights into the structure and potential applications of SSL models in music information retrieval.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2505.16306

Country: Asia > China (0.29)

Genre: Research Report (0.84)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

Learning Music Audio Representations With Limited Data

Plachouras, Christos, Benetos, Emmanouil, Pauwels, Johan

arXiv.org Artificial IntelligenceMay-12-2025

--Large deep-learning models for music, including those focused on learning general-purpose music audio representations, are often assumed to require substantial training data to achieve high performance. If true, this would pose challenges in scenarios where audio data or annotations are scarce, such as for underrepresented music traditions, non-popular genres, and personalized music creation and listening. Understanding how these models behave in limited-data scenarios could be crucial for developing techniques to tackle them. In this work, we investigate the behavior of several music audio representation models under limited-data learning regimes. We consider music models with various architectures, training paradigms, and input durations, and train them on data collections ranging from 5 to 8,000 minutes long. We evaluate the learned representations on various music information retrieval tasks and analyze their robustness to noise. We show that, under certain conditions, representations from limited-data and even random models perform comparably to ones from large-dataset models, though handcrafted features outperform all learned representations in some tasks. Large models tackling audio-based Music Information Retrieval (MIR) tasks are commonly thought to require substantial amounts of training data [1], [2].

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICASSP49660.2025.10887766

2505.06042

Country:

Asia (0.93)
North America > United States (0.47)
Europe > Netherlands (0.28)
(3 more...)

Genre: Research Report (0.64)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

CrossMuSim: A Cross-Modal Framework for Music Similarity Retrieval with LLM-Powered Text Description Sourcing and Mining

Tsoi, Tristan, Deng, Jiajun, Ju, Yaolong, Weck, Benno, Kirchhoff, Holger, Lui, Simon

arXiv.org Artificial IntelligenceMar-29-2025

--Music similarity retrieval is fundamental for managing and exploring relevant content from large collections in streaming platforms. This paper presents a novel cross-modal contrastive learning framework that leverages the open-ended nature of text descriptions to guide music similarity modeling, addressing the limitations of traditional uni-modal approaches in capturing complex musical relationships. T o overcome the scarcity of high-quality text-music paired data, this paper introduces a dual-source data acquisition approach combining online scraping and LLM-based prompting, where carefully designed prompts leverage LLMs' comprehensive music knowledge to generate contextually rich descriptions. Extensive experiments demonstrate that the proposed framework achieves significant performance improvements over existing benchmarks through objective metrics, subjective evaluations, and real-world A/B testing on the Huawei Music streaming platform. Music similarity retrieval plays an important role in many music information retrieval (MIR) tasks, such as music recommendation [1], personalized playlist generation [2] and background music replacement in video editing [3], [4]. As digital music collections rapidly expand within streaming platforms, accurately identifying similarities between musical pieces has become critical for managing and exploring relevant content from such large collections efficiently.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2503.23128

Country:

Asia > China > Hong Kong (0.05)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)

Genre: Research Report (0.50)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

HARP 2.0: Expanding Hosted, Asynchronous, Remote Processing for Deep Learning in the DAW

Benetatos, Christodoulos, Cwitkowitz, Frank, Pruyne, Nathan, Garcia, Hugo Flores, O'Reilly, Patrick, Duan, Zhiyao, Pardo, Bryan

arXiv.org Artificial IntelligenceMar-4-2025

HARP 2.0 brings deep learning models to digital audio workstation (DAW) software through hosted, asynchronous, remote processing, allowing users to route audio from a plug-in interface through any compatible Gradio endpoint to perform arbitrary transformations. HARP renders endpoint-defined controls and processed audio in-plugin, meaning users can explore a variety of cutting-edge deep learning models without ever leaving the DAW. In the 2.0 release we introduce support for MIDI-based models and audio/MIDI labeling models, provide a streamlined pyharp Python API for model developers, and implement numerous interface and stability improvements. Through this work, we hope to bridge the gap between model developers and creatives, improving access to deep learning models by seamlessly integrating them into DAW workflows.

deep learning model, endpoint, harp 2, (12 more...)

arXiv.org Artificial Intelligence

2503.02977

Country: North America > United States > California > San Francisco County > San Francisco (0.05)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Linguistic Analysis of Sinhala YouTube Comments on Sinhala Music Videos: A Dataset Study

De Mel, W. M. Yomal, de Silva, Nisansa

arXiv.org Artificial IntelligenceJan-28-2025

This research investigates the area of Music Information Retrieval (MIR) and Music Emotion Recognition (MER) in relation to Sinhala songs, an underexplored field in music studies. The purpose of this study is to analyze the behavior of Sinhala comments on YouTube Sinhala song videos using social media comments as primary data sources. These included comments from 27 YouTube videos containing 20 different Sinhala songs, which were carefully selected so that strict linguistic reliability would be maintained and relevancy ensured. This process led to a total of 93,116 comments being gathered upon which the dataset was refined further by advanced filtering methods and transliteration mechanisms resulting into 63,471 Sinhala comments. Additionally, 964 stop-words specific for the Sinhala language were algorithmically derived out of which 182 matched exactly with English stop-words from NLTK corpus once translated. Also, comparisons were made between general domain corpora in Sinhala against the YouTube Comment Corpus in Sinhala confirming latter as good representation of general domain. The meticulously curated data set as well as the derived stop-words form important resources for future research in the fields of MIR and MER, since they could be used and demonstrate that there are possibilities with computational techniques to solve complex musical experiences across varied cultural traditions

artificial intelligence, machine learning, social media, (18 more...)

arXiv.org Artificial Intelligence

2501.18633

Country:

North America > United States > New York (0.04)
Asia > Sri Lanka (0.04)

Genre: Research Report (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.67)

Add feedback

Hidden Echoes Survive Training in Audio To Audio Generative Instrument Models

Tralie, Christopher J., Amery, Matt, Douglas, Benjamin, Utz, Ian

arXiv.org Artificial IntelligenceDec-13-2024

In our work, though, we do not seek to influence As generative techniques pervade the audio domain, the behavior of the model so drastically, but rather to there has been increasing interest in tracing back through "tag" the data in such a way that the model reproduces these complicated models to understand how they draw the tag, similarly to how [10] watermark their training on their training data to synthesize new examples, both data for a diffusion image model. We are also inspired to ensure that they use properly licensed data and also to by the recent lawsuit by Getty Images against Stable elucidate their black box behavior. In this paper, we show Diffusion when it was discovered that the latter would that if imperceptible echoes are hidden in the training often reproduce the former's watermarks in its output data, a wide variety of audio to audio architectures (differentiable [29]. We would like to do something similar with audio, digital signal processing (DDSP), Realtime but to keep it imperceptible. Audio Variational autoEncoder (RAVE), and "Dance All of the above approaches use neural networks to Diffusion") will reproduce these echoes in their outputs.

artificial intelligence, echoe, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2412.10649

Country:

Europe > France > Île-de-France > Paris > Paris (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > China > Jiangsu Province > Xuzhou (0.04)

Genre: Research Report (0.50)

Industry:

Media > Music (0.95)
Leisure & Entertainment (0.95)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.87)

Add feedback

CHORDONOMICON: A Dataset of 666,000 Songs and their Chord Progressions

Kantarelis, Spyridon, Thomas, Konstantinos, Lyberatos, Vassilis, Dervakos, Edmund, Stamou, Giorgos

arXiv.org Artificial IntelligenceDec-10-2024

Chord progressions encapsulate important information about music, pertaining to its structure and conveyed emotions. They serve as the backbone of musical composition, and in many cases, they are the sole information required for a musician to play along and follow the music. Despite their importance, chord progressions as a data domain remain underexplored. There is a lack of large-scale datasets suitable for deep learning applications, and limited research exploring chord progressions as an input modality. In this work, we present Chordonomicon, a dataset of over 666,000 songs and their chord progressions, annotated with structural parts, genre, and release date - created by scraping various sources of user-generated progressions and associated metadata. We demonstrate the practical utility of the Chordonomicon dataset for classification and generation tasks, and discuss its potential to provide valuable insights to the research community. Chord progressions are unique in their ability to be represented in multiple formats (e.g. text, graph) and the wealth of information chords convey in given contexts, such as their harmonic function . These characteristics make the Chordonomicon an ideal testbed for exploring advanced machine learning techniques, including transformers, graph machine learning, and hybrid systems that combine knowledge representation and machine learning.

artificial intelligence, deep learning, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2410.22046

Country:

North America > United States > New York > New York County > New York City (0.05)
North America > Puerto Rico > Peñuelas > Peñuelas (0.04)
Europe > Spain > Andalusia > Málaga Province > Málaga (0.04)
(8 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback